This tutorial will walk you through the entire data science pipeline starting from data collection and processing, then moving on to exploratory data analysis and data visualization. Next, we will use hypothesis testing and machine learning to provide analysis. Lastly, we will show the messages covering insight learned during the tutorial. However, this tutorial will be focusing more on the data processing and analysis using visualization created using Pyplot and Plotly library.

The data set we will use to analyze is the Homicide Reports (1980-2014) from FBI and FOIA which can be download here. The reason for choosing this data is because it contains many variables that we are able to do variety of analyzsis with from different angles. In addition, by analying the homicide reports and looking at the number of cases hopefully we can find the trends and be more aware of how series it can be.
import pandas as pd
import numpy as np
data = pd.read_csv("database.csv", low_memory=False)
df = pd.DataFrame(data)
#replace unknown data to np.nan
df.replace('Unknown',np.nan, inplace='true')
df.head()
For the first plot, we are interested in seeing the number of cases each year from 1980 to 2014. To count the number of cases instead of directly counting the incident column, I have used the size() method of groupby to count the number of rows in each group because I thought it might be a little bit more accurate since 2 incidents can be between the same victim and perpetrator.
import matplotlib.pyplot as plt
plt.style.use('ggplot')
g = df.groupby('Year')
years = sorted(g.groups.keys())
size = g.size().values.ravel()
fig, ax = plt.subplots()
ax.plot(years, size, marker='.', linestyle='-', ms=5, color = 'purple', alpha = .5)
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 1))
ax.set_xlabel('Year')
ax.set_ylabel('Number of Homicide')
ax.set_title("1980-2014 Number of Homicide by Year")
plt.xticks(rotation=90)
plt.show()
plt.close("all")
From the plot above we can immediately see that there is a huge declined in the number of homicide in the late 1990s. Unfortunately, after some research there still seems to be no definite answer for the caused of the declined. However, there are some articles talk about some of the guesses for the declined. The links are provide below:
We can use sklearn that fit a linear regression model into the above graph.
from sklearn import linear_model
regr = linear_model.LinearRegression()
#fitting the regresson model
x = years
y = size
x = np.reshape(x,(-1,1))
y = y.reshape(-1,1)
regr.fit(x, y)
fig, ax = plt.subplots()
ax.plot(years, size, marker='.', linestyle='None', ms=5, color = 'purple', alpha = .5)
ax.plot(years, regr.predict(x).ravel(), color='blue', alpha= .5)
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 1))
ax.set_xlabel('Year')
ax.set_ylabel('Number of Homicide')
ax.set_title("1980-2014 Number of Homicide by Year")
plt.xticks(rotation=90)
plt.show()
plt.close("all")
We can see from the regression line that even the number of homicide climb up around 1980s and early 2000s, the overall trend for number of homicide is decreasing as year passed.
For the next plot, we would like to see the relationship between the victim and the perpetrator. "Other" category includes some of the closer relationships which are:
Neighbor
Boyfriend/Girlfriend
Friend
Family
Common-Law Husband
Common-Law Wife
Stepdaughter
Stepfather
Stepmother
Stepson
Ex-Husband
Ex-Wife
Employee
Employer
g1 = df.groupby(['Year','Relationship'])
g1 = g1.size()
#Create a new dataframe that has the year as index and three colums indicting number of 'Stranger','Acquaintance', and 'Other'
df2 = pd.DataFrame(index = years, columns = ['Stranger', 'Acquaintance', 'Other'])
stranger = []
acq= []
other = []
y = 1980
s = 0
#counting the total number of specific relationship for each year
for index, series in g1.iteritems():
if(index[1] == 'Stranger'):
stranger.append(series)
elif (index[1] == 'Acquaintance'):
acq.append(series)
else:
s += series
if(y == index[0]-1):
y += 1
other.append(s)
s = 0
other.append(s)
df2['Stranger'] = stranger
df2['Acquaintance'] = acq
df2['Other'] = other
f, ax1 = plt.subplots(1, figsize=(20,6))
ax1.set_xlabel('Year')
ax1.set_ylabel('Number of Homicide')
ax1.set_title("Relationship Between Victim and Perpetrator")
df2.plot.bar(stacked=True,ax=ax1, alpha = .5, width = .8, color =['#F4561D','#F1911E','#F1BD1A'])
plt.show()
From the above bar graph, we can see the decrease number of homicide in all three category of relationships we analyzed. However, while the number of cases start off pretty close for stranger and acquaintance, we can see the decreasing trend for the acquaintance is more obvious than stranger. We can also see that the number of homicide between some closer relationships mark by 'Other' did not seems to decrease a lot compare to other two, the number even become very close to the number of acquaintance in the 2000s.
After we have seen the number of homicide by the relationships between victim and perpetrator we might also want to know the sex of the victim and perpetrator, so we have included the graphs for number of homicide victim and perpetrator by sex.
g2 = df.groupby(['Year','Victim Sex'])
#reshape panda series to one column of # of Female Victim and one column of # Male Victim
g2 = g2.size().values.reshape(35,2)
df3 = pd.DataFrame(index = years, columns =['#Female Victim','#Male Victim'], data=g2)
f, ax2 = plt.subplots(1, figsize=(20,6))
ax2.set_xlabel('Year')
ax2.set_ylabel('Number of Victim')
ax2.set_title("Sex of Homicide Victim")
df3.plot.bar(ax=ax2, color=['r','b'],alpha=0.5, width=0.8)
g3 = df.groupby(['Year','Perpetrator Sex'])
g3 = g3.size().values.reshape(35,2)
df4 = pd.DataFrame(index = years, columns =['#Female Perpetrator','#Male Perpetrator'], data=g3)
f, ax3 = plt.subplots(1, figsize=(20,6))
ax3.set_xlabel('Year')
ax3.set_ylabel('Number of Perpetrator')
ax3.set_title("Sex of Homicide Perpetrator")
df4.plot.bar(ax=ax3, color=['r','b'],alpha=0.5, width=0.8)
plt.show()
It might not be very surprising to see male number to be much more than female, but it is interesting to see that the trend for victim and perpetrator seems almost identical. From the resulted graphs, we also noticed the number for the perpetrator is less than the number of victim which is because the lack of information of the perpetrator for the unsolved cases. So, for the next plot we will show the percentage of homicide solved each year.
g4 = df.groupby(['Year','Crime Solved'])
#reshape panda series to one column of # of Female Victim and one column of # Male Victim
g4 = g4.size().values.reshape(35,2)
df5 = pd.DataFrame(index = years, columns =['#Not Solved','#Solved'], data=g4)
#calculate the crime solve percentage
df5['Crime Solved %'] = (df5['#Solved']/(df5['#Solved']+df5['#Not Solved'])*100)
x = years
y = df5['Crime Solved %']
x = np.reshape(x, (-1,1))
y = y.values.reshape(-1,1)
regr = linear_model.LinearRegression()
#fitting the regresson model
regr.fit(x, y)
fig, ax5 = plt.subplots()
ax5.plot(df5.index, df5['Crime Solved %'], marker='.', linestyle='None', ms=5, color = 'orange')
ax5.xaxis.set_ticks(np.arange(start, end, 1))
ax5.set_xlabel('Year')
ax5.set_ylabel('Percentage of Homicide Solved')
ax5.set_title("1980-2014 Percentage of Homicide Solved by Year")
# plot the regression line
ax5.plot(x.ravel(), regr.predict(x).ravel(), color='blue', alpha= .5)
plt.xticks(rotation=90)
plt.show()
The result was a little bit unexpected for me. At first, I thought we can see a clear linear line of increase percentage of solved homicide cases because of the technology improvement and since we can learn from previous experience. However, we can see from the regression line, the percentage of solved cases is actually decreasing as time passed. The reason could be because perpetrators also have the knowledge and the technology that make the cases harder to solve.
Choropleth map is another power technique that provides strong visualization. Below we will use the choropleth map to show the total number of homicide cases from 1980 to 2014.
g5 = df.groupby('State')
g5 = g5.size()
# since we can not show DC in the 50 states map we decided to add the number of DC into Maryland since DC is located at Maryland
maryland_total = g5.get('District of Columbia') + g5.get('Maryland')
g5.set_value('Maryland',maryland_total)
# We will need to translate the states name into state code for the plotly map to process the data
code = ['AL','AK', 'AZ', 'AR','CA', 'CO','CT','DE','DC','FL',
'GA', 'HI', 'ID','IL','IN','IA', 'KS','KY', 'LA', 'ME', 'MD','MA','MI',
'MN', 'MS', 'MO','MT', 'NE', 'NV', 'NH','NJ','NM','NY', 'NC', 'ND', 'OH',
'OK', 'OR','PA','RI','SC', 'SD','TN', 'TX','UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']
g5.keys= code
import plotly
#must added this code to use plotly offline so you do not have to have an account with plotly
plotly.offline.init_notebook_mode()
#we created the purple color scale to use
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
[0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]
data = [ dict(
type='choropleth',
colorscale = scl,
autocolorscale = False,
locations = g5.keys,
z = g5.values,
locationmode = 'USA-states',
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
) ),
colorbar = dict(
title = "Number of cases")
) ]
layout = dict(
title = '1980 - 2014 Number of Homicide by State',
geo = dict(
scope='usa',
projection=dict( type='albers usa' ),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)'),
)
fig = dict( data=data, layout=layout )
plotly.offline.iplot( fig )
From the result, we can make a prediction that the states that has higher population also have higher number of cases. To prove our hypothesis we have also gather the state population data from Census The data was maintained by gathering state population each year from 1980 to 2014 manually and put it in the excel document.
popdf = pd.read_excel("state_population.xlsx")
#adding extra columns to include number of cases which can be use later
popdf['Cases']= g5.values
#same as the pervious data we also want to add the data of DC into MD and take the average of each year population
m = popdf.loc[popdf['State']=='MD']
d = popdf.loc[popdf['State']=='DC']
t = m.values+d.values
t = np.delete(t,0)
t = np.delete(t,35)
a = np.mean(t)
popdf.set_value(20,'Average', a)
popdf.head()
plotly.offline.init_notebook_mode()
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
[0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]
data = [ dict(
type='choropleth',
colorscale = scl,
autocolorscale = False,
locations = popdf['State'],
z = popdf['Average'],
locationmode = 'USA-states',
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
) ),
colorbar = dict(
title = "Population")
) ]
layout = dict(
title = '1980 - 2014 Average Population by State',
geo = dict(
scope='usa',
projection=dict( type='albers usa' ),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)'),
)
fig = dict( data=data, layout=layout )
plotly.offline.iplot( fig )
The result from the average population data agrees with our hypothesis and looks almost identical as the pervious choropleth map.
To have a better view of the relationship between number of homicide by state and the average population by state, we can use the linear model to draw the regression line again.
fig, ax6 = plt.subplots()
ax6.plot(popdf.Cases,popdf.Average, linestyle='None', marker='.' )
x = popdf.Cases
y = popdf.Average
x = x.values.reshape(-1,1)
y = y.values.reshape(-1,1)
regr = linear_model.LinearRegression()
regr.fit(x, y)
ax6.plot(x.ravel(), regr.predict(x).ravel(), color='blue', alpha= .5)
#removing auto offset and scientific notation fo large number
ax6.ticklabel_format(useOffset=False, style='plain')
ax6.set_xlabel('Number of Homicide')
ax6.set_ylabel('Average State Population')
ax6.set_title("1980-2014 Total Number of Homicide vs. States Average Population")
plt.show()
The result clearly shows a positive relation between the number of homicide and the population. As the increase of the population increase the number of homicide cases also increase.
import queue
dfState = df.groupby(['State','Year']).size()
#find and fill missing data using queue to check the years range for each state
years = queue.Queue()
#range is inclusive for the start values and exclusive for the end value
for j in range(1980,2015):
years.put(j)
#iter rows in the dataframe and find the years that each states is missing a data and add np.nan to it
for i, row in dfState.iteritems():
if(years.empty()):
for j in range(1980,2015):
years.put(j)
y = years.get()
if(type(i) != int):
if(i[1] != y):
for x in range(y, i[1]):
dfState.loc[(i[0],x)] = np.nan
y = years.get()
#tranfer pandas series to data frame
dataState = dfState.to_frame('Crime')
#making year and state columns
dataState = dataState.reset_index()
#sort dataframe first by year then by state
dataState.sort_values(by=['Year', 'State'], inplace=True)
#now we want to add the population data to our dataframe
#drop the unused columns
temp_pop = popdf.drop('Average',1)
temp_pop.drop('Cases', 1,inplace=True)
temp_pop.drop('State', 1,inplace=True)
temp_pop = temp_pop.transpose()
pop = temp_pop.as_matrix()
dataState['Population'] = pop.reshape(1785,1)
dataState['pop'] = dataState['Population']
#rearrange to use it in the bubble chart
dataState = dataState[['Year','pop','Crime','Population','State']]
Finally we can import motionchart and pass dataframe to the motion chart The bubble size will be determine by the size of the state population The y-axis for the motion chart will be the population and year, the x-axis will be the number of crime
from motionchart.motionchart import MotionChart, MotionChartDemo
mChart = MotionChart(df=dataState)
mChart.to_browser()
Data:
Visualization Tool:
Others :